Non-parametric Machine Learning Methods for Clustering and Variable Selection

نویسندگان

  • Qian Liu
  • Eric Bair
  • Michael Kosorok
  • Andrew Nobel
  • Gary Slade
  • Donglin Zeng
چکیده

Qian Liu: Non-parametric machine learning methods for clustering and variable selection (Under the direction of Eric Bair) Non-parametric machine learning methods have been popular and widely used in many scientific research areas, especially when dealing with high-dimension low sample size (HDLSS) data. In particular, clustering and biclustering approaches can serve as exploratory analysis tools to uncover informative data structures, and random forest models have their advantage in coping with complex variable interactions. In many situations it is desirable to identify clusters that differ with respect to only a subset of features. Such clusters may represent homogeneous subgroups of patients with a disease. In this dissertation, we first propose a general framework for biclustering based on the sparse clustering method. Specifically, we develop an algorithm for identifying features that belong to biclusters. This framework can be used to identify biclusters that differ with respect to the means of the features, the variances of the features, or more general differences. We apply these methods to several simulated and real-world data sets, and the results of our methods compare favourably with previous published methods, with respect to both predictive accuracy and computing time. As a follow up to the biclustering study, we further look into the sparse clustering algorithm, and point out a few limitations of their proposed method for tuning parameter selection. We propose an alternative approach to select the tuning parameter, and to better identify features with positive weights. We compare our algorithm with the existing sparse clustering method on both simulated and real world data sets, and iii the results suggest that our method out-performs the existing method, especially in presence of weak clustering signal. For the last project, we consider random forest variable importance (VIMP) scores. We propose an alternative algorithm to calculate the conditional VIMP scores. We test our proposed algorithm on both simulated and real-world data sets, and the results suggested that our conditional VIMP scores could better reveal the association between predictor variables and the modelling outcome, despite the correlation among predictor variables.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stock Price Prediction using Machine Learning and Swarm Intelligence

Background and Objectives: Stock price prediction has become one of the interesting and also challenging topics for researchers in the past few years. Due to the non-linear nature of the time-series data of the stock prices, mathematical modeling approaches usually fail to yield acceptable results. Therefore, machine learning methods can be a promising solution to this problem. Methods: In this...

متن کامل

Diagnosis of Heart Disease Based on Meta Heuristic Algorithms and Clustering Methods

Data analysis in cardiovascular diseases is difficult due to large massive of information. All of features are not impressive in the final results. So it is very important to identify more effective features. In this study, the method of feature selection with binary cuckoo optimization algorithm is implemented to reduce property. According to the results, the most appropriate classification fo...

متن کامل

Non-parametric Bayesian Kernel Models

1 SUMMARY Kernel models for classification and regression have emerged as widely applied tools in statistics and machine learning. We discuss a Bayesian framework and theory for kernel methods, providing a new rationalisation of kernel regression based on non-parametric Bayesian models. Functional analytic results ensure that such a non-parametric prior specification induces a class of function...

متن کامل

Model-free Variable Selection in Reproducing Kernel Hilbert Space

Variable selection is popular in high-dimensional data analysis to identify the truly informative variables. Many variable selection methods have been developed under various model assumptions. Whereas success has been widely reported in literature, their performances largely depend on validity of the assumed models, such as the linear or additive models. This article introduces a model-free va...

متن کامل

Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

We consider the problem of classification using high dimensional features’ space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classifiers, that is, to treat the features as if they are statistically independent. Consider now a sparse setup, where only a few of the features are informative for classification. Fan and Fan (2008), suggested a variable selection and...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014